Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu improvements #66

Closed
wants to merge 30 commits into from
Closed

Conversation

jedwards4b
Copy link
Collaborator

Remove special definitions for gpu enabled compilers. WIP

@jedwards4b jedwards4b self-assigned this Nov 3, 2022
@jedwards4b jedwards4b marked this pull request as draft November 3, 2022 14:44
@fischer-ncar fischer-ncar self-requested a review November 3, 2022 15:41
message("GPU_TYPE is ${GPU_TYPE} GPU_OFFLOAD is ${GPU_OFFLOAD}")
if (GPU_TYPE STREQUAL v100 AND GPU_OFFLOAD STREQUAL OpenACC)
string(APPEND GPUFLAGS " -acc -gpu=cc70,lineinfo,nofma -Minfo=accel ")
endif()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we issue an error message and exit the compilation for a configuration other than V100 GPU and OpenACC? In this way we won't accidentally compile the code with undesired flags.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we could add the A100 GPU and OpenACC configuration here with the flag -acc -gpu=cc80,lineinfo,nofma -Minfo=accel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with adding an error message and exiting - we want users to be able to expand the list of supported compilers, machines and offload options. I can look into adding a warning.

Yes we can add the a100 options.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my concern is that saying we specify V100 GPU and OpenMP offload here, the codes will still compile but no GPU flags are actually applied, which is confusing to me. By adding an error message, we can force a user to add the correct options before the code can be compiled. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I can enforce this by making the MAX_GPUS_PER_NODE variable in config_machines.xml compiler dependent. How does that sound?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I did not realize that you had provided a full list of combinations for OpenACC and OpenMP offload. In this case, my concern is probably unnecessary anymore.

</directives>

<directives queue="casper" compiler="nvhpc-gpu">
<directives queue="casper" compiler="nvhpc">
<!-- Turn on MPS server manually -->
<!-- This is a temporary solution and should be removed once MPS is integrated into PBS on Casper -->
<directive default="/bin/bash" > -S /glade/u/apps/dav/opt/nvidia-mps/mps_bash </directive>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CISL has integrated the MPS into PBS scheduler. So we could turn on MPS by adding mps=1 to the PBS resource -l select=...., which is the next line. This temporary solution is no longer needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the mps=1 be added to the nvhpc select regardless of whether gpus are actually used or should it only be there for gpu cases?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question and I do not know. For safety, can we add the ngpus and mps options only when GPU_OFFLOAD method is not none?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we have the ngpus_per_node option for create_newcase script. That could also be used to determine if a job should be submitted to a GPU queue or not, and what PBS resources should be selected later.

@@ -478,15 +479,17 @@ This allows using a different mpirun command to launch unit tests
<command name="load">openmpi/4.1.0</command>
<command name="load">netcdf-mpi/4.7.4</command>
<command name="load">pnetcdf/1.12.2</command>
</modules>
<modules mpilib="openmpi" compiler="nvhpc" gpu_offload="OpenACC">
<command name="load">cuda/11.0.3</command>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I later realized that the nvhpc module already comes with CUDA. So we probably did not need to load CUDA manually unless the newer CUDA version had a problem. May be worth testing it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already tested - on casper the cuda module seems to be required even though it shouldn't be

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. In this case, can we upgrade the CUDA version to 11.6 since that is the version coming with nvhpc/22.2?

<command name="load">netcdf-mpi/4.8.1</command>
<command name="load">pnetcdf/1.12.2</command>
<command name="load">pnetcdf/1.12.3</command>
</modules>
<modules mpilib="mpi-serial" compiler="nvhpc">
<command name="load">netcdf/4.8.1</command>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we delete the lines 497 to 505?

<modules compiler="nvhpc" mpilib="openmpi" DEBUG="FALSE">
<command name="use">/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgs/nvhpc/22.2/</command>
<command name="load">esmf-8.4.0b08_casper-ncdfio-openmpi-O</command>
</modules>
<modules compiler="pgi" mpilib="openmpi" DEBUG="TRUE">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we delete the PGI related settings for Casper now?

@@ -417,6 +417,7 @@ This allows using a different mpirun command to launch unit tests
<DIN_LOC_ROOT_CLMFORC>/glade/p/cgd/tss/CTSM_datm_forcing_data</DIN_LOC_ROOT_CLMFORC>
<DOUT_S_ROOT>$CIME_OUTPUT_ROOT/archive/$CASE</DOUT_S_ROOT>
<BASELINE_ROOT>$ENV{CESMDATAROOT}/cesm_baselines</BASELINE_ROOT>
<CCSM_CPRNC>$ENV{CESMDATAROOT}/tools/cime/tools/cprnc/cprnc</CCSM_CPRNC>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove pgi and *-gpu at line 413.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I haven't completed the cleanup here yet.

if (GPU_TYPE STREQUAL a100 AND GPU_OFFLOAD STREQUAL OpenACC)
string(APPEND GPUFLAGS " -acc -gpu=cc80,lineinfo,nofma -Minfo=accel ")
endif()
if (GPU_TYPE STREQUAL a100 AND GPU_OFFLOAD STREQUAL OpenACC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace OpenACC with OpenMP.

jedwards4b and others added 21 commits November 3, 2022 10:11
Use nvhpc/22.11, cray-mpich/8.1.21, and esmf/8.4.1.b02

	modified:   machines/cmake_macros/gust.cmake
	modified:   machines/config_machines.xml
…_config_cesm/compare/52c06b3..fbc05d6

Update settings on Gust and Casper

	modified:   Depends.nvhpc
	deleted:    Depends.nvhpc-gpu
	deleted:    cmake_macros/nvhpc-gpu.cmake
	deleted:    cmake_macros/nvhpc-gpu_casper.cmake
	modified:   cmake_macros/nvhpc.cmake
	modified:   cmake_macros/nvhpc_casper.cmake
	deleted:    cmake_macros/pgi-gpu.cmake
	deleted:    cmake_macros/pgi-gpu_casper.cmake
	modified:   config_batch.xml
	modified:   config_machines.xml
	deleted:    mpi_run_gpu.casper
…asper and Gust

	modified:   machines/config_machines.xml
	modified:   machines/config_machines.xml
	modified:   machines/config_machines.xml
	modified:   machines/config_machines.xml
load cuda module on Gust for a GPU run

	deleted:    cmake_macros/nvhpc_gust.cmake
	modified:   config_machines.xml
	modified:   machines/config_machines.xml
	modified:   config_batch.xml
	modified:   machines/config_batch.xml
	modified:   machines/config_machines.xml
	modified:   config_machines.xml
@jedwards4b jedwards4b marked this pull request as ready for review May 26, 2023 20:04
@jedwards4b jedwards4b requested a review from sjsprecious May 26, 2023 20:04
Copy link
Collaborator

@sjsprecious sjsprecious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jedwards4b for issuing this PR. It generally looks good to me but I have some clarifications / change requests before approving it.

<modules mpilib="openmpi" compiler="nvhpc">
<command name="load">cuda/11.6</command>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we only load the CUDA module when gpu_offload is not None? Or gpu_enabled is False?

<MAX_MPITASKS_PER_NODE>36</MAX_MPITASKS_PER_NODE>
<MAX_CPUTASKS_PER_GPU_NODE>36</MAX_CPUTASKS_PER_GPU_NODE>
<GPU_TYPES>v100,a100</GPU_TYPES>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we specify gpu_offload variable as well? For machine with AMD GPUs, openacc and combined may not be a valid input.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle fortran do concurrent here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question! In this case, combined may be a confusing word, too. Is there any plan to introduce do concurrent to CAM in the future?

<MAX_MPITASKS_PER_NODE>128</MAX_MPITASKS_PER_NODE>
<MAX_CPUTASKS_PER_GPU_NODE>64</MAX_CPUTASKS_PER_GPU_NODE>
<GPU_TYPES>a100</GPU_TYPES>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same question about gpu_offload variable.

<MAX_MPITASKS_PER_NODE>128</MAX_MPITASKS_PER_NODE>
<MAX_CPUTASKS_PER_GPU_NODE>64</MAX_CPUTASKS_PER_GPU_NODE>
<GPU_TYPES>a100</GPU_TYPES>
<PROJECT_REQUIRED>TRUE</PROJECT_REQUIRED>
<mpirun mpilib="default">
<executable>mpiexec</executable>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 1838, shall we do purge first before loading any module?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a feature of modules on gust (and soon derecho) - the two modules loaded above the purge are sticky and not affected by the purge command.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. That is clear to me now.

<submit_args>
<argument> -l gpu_type=$GPU_TYPE </argument>
</submit_args>
<directives queue="casper" compiler="nvhpc" gpu_enabled="true">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this works but I am not sure how gpu_enabled actually works. Is it an XML variable defined somewhere? And how is it set to True or False during the build. A brief explanation will be very helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done here: https://github.com/jedwards4b/cime/blob/add_gpu_gust/CIME/case/case.py#L457

gpu_enabled is an attribute of the case object and is set to true if GPU_TYPE is set to a valid value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jedwards4b for the details. That is very helpful!

@jedwards4b jedwards4b closed this Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants